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Abstract 

Background: The expansion of polyglutamine (poly-Q) repeats in several unrelated proteins is associated with at 
least ten neurodegenerative diseases. The length of the poly-Q regions plays an important role in the progression 
of the diseases. The number of glutamines (Q) is inversely related to the onset age of these polyglutamine 
diseases, and the expansion of poly-Q repeats has been associated with protein misfolding. However, very little is 
known about the structural changes induced by the expansion of the repeats. Computational methods can 
provide an alternative to determine the structure of these poly-Q proteins, but it is important to evaluate their 
performance before large scale prediction work is done. 

Results: In this paper, two popular protein structure prediction programs, l-TASSER and Rosetta, have been used to 
predict the structure of the N-terminal fragment of a protein associated with Huntington's disease with 17 
glutamines. Results show that both programs have the ability to find the native structures, but l-TASSER performs 
better for the overall task. 

Conclusions: Both l-TASSER and Rosetta can be used for structure prediction of proteins with poly-Q repeats. 
Knowledge of poly-Q structure may significantly contribute to development of therapeutic strategies for poly-Q 
diseases. 



Background 

Knowledge of protein structure can be critical for devising 
therapeutic strategies for diseases in which protein dys- 
function contributes to pathogenesis. For the polygluta- 
mine (poly-Q) diseases, pathogenic poly-Q expansions 
typically cause gains of toxic functions associated with 
protein misfolding or aberrant interactions with RNAs or 
other proteins [1]. At least ten neurodegenerative disor- 
ders are caused by poly-Q expansions, including Hunting- 
ton's disease (HD), dentatorubral and pallidoluysian 
atrophy (DRPLA), spinal and bulbar muscular atrophy 
(SBMA), and the poly-Q spinocerebellar ataxias [2] 
(SCA1, SCA2, SCA3, SCA6, SCA7, SCA8, and SCA17) 
[3-5]. The proteins involved in these diseases have no sig- 
nificant sequence, compositional or structural homologies 
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[6,7] and numerous studies and observations have estab- 
lished that the length of the polyglutamine repeats plays a 
critical role in the progress and pathogenesis of these dis- 
eases [5,8]. Analysis from patients' data reveals that the 
expansion of polyglutamine repeats beyond certain patho- 
logical threshold causes the disease phenotype (Table 1) 
[9-12]. Also the number of the glutamines in the polyglu- 
tamine region is inversely correlated with age of onset 
[9,13-17]. For instance for SCA2, people with 32 or 33 
repeats tend to first experience symptoms of SCA2 in late 
adulthood, while people with more than 45 repeats usually 
have symptoms by their teens [2]. 

One possible mechanism for these diseases pathology is 
the assembly of unfolded protein monomers into [3 -sheet 
amyloid fibers [18]. Both in vivo and in vitro studies have 
shown that the poly-Q expansion may lead to protein 
misfolding [19] and may cause a structure transition to 
form parallel P -helix and P -sheet folds [20]. Protein mis- 
folding and aggregation has been shown to depend on 
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Table 1 Length of poly-Qs in polyglutamine diseases. 



Gene Wild-type allele Mutant allele 

repeat number repeat number 



SCA1 


ATXN1 


6-39 


41-83 


SCA2 


ATXN2 


13-31 


> = 32 


SCA3 


ATXN3 


12-43 


60-89 


SCA6 


CACNA1A 


<18 


20-33 


SCA7 


ATXN7 


<19 


36-460 


SCA8 


ATXN8 


1 5-50 a 


80-250 a 


SCA17 


TBP 


25-42 


49-66 


HD 


HTT 


10-26 


>40 


DRPLA 


ATN1 


6-35 


>48 


SBMA 


AR 


< =36 


>38 



a The non-coding gene ATXN80S gene may contribute to SCA8 pathogenesis. 
The number of CAG repeats in normal and mutant ATXN8 are assumed to be the 
same as the number of CTG repeats in normal and mutant ATXN80S, as in [4]. 



the poly-Q length and the concentration of the protein 
[21-23]. As shown in [24] the poly-Q tract will form 
P -sheet structures when the number of the Qs increases 
resulting in an increase of the chance of aggregation. 
Therefore the understanding of the effect of the length- 
ening of the poly-Q repeat segment on protein folding 
can provide new insights and perhaps therapies for these 
diseases. 

Although the association of the lengthening of the poly- 
Q repeats with the related polyglutamine diseases has 
been known for almost 20 years [25,26], high-resolution 
structural analysis of these proteins in their native context 
has eluded researchers [27] and only very limited experi- 
mental information exists. Kim has crystallized multiple 
structures of the N-terminal segment of huntingtin protein 
with 17 and 36 glutamines repeats [28,29], finding that the 
poly-Q regions exhibit conformational flexibility with 
a-helix, random coil, and extended loops [28,29]. These 
structures are the only crystal structures of poly-Q seg- 
ments available in the RCSB PDB database. Computational 
modeling can provide valuable insights to this problem 
[23,30,31], but to our knowledge no comprehensive stu- 
dies have been reported comparing the 3D structures pre- 
dicted for these segments with the limited experimental 
data available. 

The accuracy of the structures obtained using 3D 
structure prediction programs is improving rapidly, and 
some of the commonly available programs have shown 
excellent performance in the CASP competition [32]. 
However, all the 3D structure prediction programs are 
trained with a variety of proteins and their performance 
is usually evaluated on a general dataset [33]. There is no 
literature evidence reporting the performance of these 
programs on proteins containing poly-Q tracts. So it is 
necessary for us to evaluate the performance of these 
programs before we use them to predict the structure of 
polyglutamine disease proteins at large scale. 



In this paper we present our results of the evaluation 
of the prediction performance of two efficient and popu- 
lar 3D structure prediction programs, I-TASSER and 
Rosetta, on the N-terminal end of huntingtin protein 
with 17 glutamines (HTT17Q-EX1). 

Results 

Predicted models 

As evidence shows that the poly-Q region can adopt dif- 
ferent structures [28,29] in the proteins of interest for 
poly-Q diseases, it is not appropriate to seek the 'best 
structure' of this region, but it is more appropriate to 
look for ensembles of structures (generated by multiple 
independent runs) which can show overall trends and 
represent the variety of structures observed by experi- 
mental methods. 

Following this reasoning, both Rosetta and I-TASSER 
were run 10 times using different random seeds for 
each run of 3D structure prediction of the HTT17Q- 
EX1 sequence shown in Figure 1(b). For each run we 
kept the five best models, so a total of 50 I-TASSER 
models and 50 Rosetta models were retained for 
analysis. 

Each structure prediction program will return some 
parameters to estimate the accuracy of the models. For 
I-TASSER, the C-score, which lies in the (-5,2) range, is 
calculated for each model [34]. The C-scores of the best 
50 I-TASSER models, listed in Table 2, range from 
-2.62 to -4.72. 

The clustering algorithm from Rosetta was used to 
identify the most frequently sampled conformations. For 
each run we selected the five structures with the lowest 
energy from the structures encountered in the five dif- 
ferent clusters in which the number of structures was 
greater than 10 on each. The energies of the total 50 
Rosetta structures, listed in Table 3, they range from 
16.06 to 20.13. 

Secondary structure 

For better visualization, WebLogo [35] was used to dis- 
play secondary structure patterns. The WebLogo of the 
secondary structures of the experimental PDB structures 
and the best I-TASSER and Rosetta models are shown in 
Figure 2. For easy description, we divided the sequence 
into three regions: the 17-residue head region including 
residues 1 to 17; the poly-Q region including residues 18 
to 34 and C-terminal region including residues 35 to 60. 
As discussed in the original publication for the 21 PDB 
structures most crystals show a-helix in the head region, 
which is always well resolved, with only a few structures 
showing turns at the beginning and end of the head 
region. Both the I-TASSER and Rosetta best models 
reproduce the observed trends showing a majority of 
helix structures in the head region, but the I-TASSER 
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(a) 



MBP 


3A 


N-terminal 


Poly-Q 


Poly-P 


Poly-P/Q 


C-tag 


(17aa) 


(17aa) 


(llaa) 


(15aa) 


(19aa) 



>HTT17Q-EX1 

MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQPLLPQ 
Figure 1 The sequence construction of HTT17Q-EX1. (a) sequence structure of the PDB records; (b) sequence used for structure prediction. 



structures show better agreement with the experimental 
findings showing a preference for a-helix, while the 
Rosetta structures show a mix of a-helix and 3-helix. 
The secondary structure, for the resolved structures, in 
the poly-Q region is more diverse showing a number of 
structures with a-helix, random coils and turns. The 
Pro-enriched C-terminal region is dominated, at least for 
the resolved structures in this region, by coil structures. 
Unfortunately, as depicted in Figure 2(a), the number of 
well resolved structures rapidly decreases beyond the 
head region making comparison with the experiments 
less reliable. None-the-less the overall experimental 
trends are reproduced by both I-TASSER and Rosetta, 
but it appears that the I-TASSER structures show more 
loops than the experimental data. 

Overall I-TASSER appears to be superior reproducing 
quite well the stable a-helix structure of the N-terminal 
regions and showing increased diversity of structures in 
the poly-Q region and a predominance of coil structures 
in the C-terminal region. 

Reproducibility of I-TASSER and Rosetta results 

In order to test the sensitivity of I-TASSER and Rosetta 
with the selection of the seeds used in the calculations, 
we have calculated the structure similarity using the 
TM-score between models obtained using the same pre- 
diction program. A total of 1225 TM-scores were gener- 
ated comparing pairwise the best 50 I-TASSER and 50 
Rosetta models, respectively. 



TM-scores between any two models from I-TASSER 
range from 0.2781 to 0.7163, with an average of 0.4086 
and a standard deviation 0.0692. Whereas the TM-scores 
between any two Rosetta models range from 0.2865 to 
0.8236, with an average of 0.4979 and a standard deviation 
0.0892. The difference between TM-scores of I-TASSER 
and Rosetta is statistically significant (t-test, p < 0.001, 
Figure 3). The number of TM-scores greater than 0.5 is 
two times greater for Rosetta/Rosetta pairs than for 
I-TASSER/I-TASSER pairs, i.e. 561 pairs in Rosetta and 
126 pairs in I-TASSER have scores larger than 0.5. 

When comparing only the best models of each run, 
the TM-scores range from 0.4539 to 0.6813 for I-TAS- 
SER (Table 4) and from 0.2872 to 0.6879 for Rosetta 
(Table 5). Therefore the best models of each run from 
I-TASSER are more similar among themselves than 
those from Rosetta, i.e. 33 pairs of the 45 structure pairs 
have TM-scores greater than 0.5 for I-TASSER, whereas 
for Rosetta, only 18 pairs of best models have TM- 
scores greater than 0.5. 

The sensitivity to the selected random seeds was also 
evaluated at the run level. TM-scores were calculated 
for the structures of any 5 models in one run compared 
with any 5 models of other runs. The number of pairs 
with TM-score greater than 0.5 between any two experi- 
ments is shown in Table 6 for I-TASSER and Table 7 
for Rosetta. For I-TASSER, the number of pairs with 
TM-score greater than 0.5 ranges from 0 to 6. There 
are 6 pairs with TM-scores greater than 0.5 between 



Table 2 C-scores for the best I-TASSER models. 



Model 


# 1 


# 2 


# 3 


#4 


# 5 


Run 1 


-2.91 


-3.69 


-3.33 


-3.62 


-4.72 


Run 2 


-2.84 


-3.71 


-3.31 


-3.5 


-4.42 


Run 3 


-2.81 


-3.76 


-3.48 


-3.89 


-3.74 


Run 4 


-2.62 


-3.69 


-3.32 


-3.76 


-3.49 


Run 5 


-3.02 


-3.21 


-3.91 


-4.11 


-3.42 


Run 6 


-2.67 


-3.76 


-3.48 


-3.62 


-4.37 


Run 7 


-2.77 


-3.42 


-3.96 


-3.3 


-3.51 


Run 8 


-3.09 


-3.45 


-4.09 


-4.22 


-4.27 


Run 9 


-2.73 


-3.61 


-3.38 


-3.76 


-4.42 


Run 10 


-2.62 


-3.49 


-3.75 


-4.33 


-4.01 



Table 3 Energy for the best Rosetta models. 



Model 


# 1 


# 2 


# 3 


# 4 


# 5 


Run 1 


16.061 


16.349 


1 8.609 


1 9.656 


19.956 


Run 2 


17.881 


18.309 


18.373 


1 8.386 


19.215 


Run 3 


1 6.943 


17.598 


17.639 


1 8.306 


19.436 


Run 4 


18.414 


1 8.662 


18.691 


18.812 


19.076 


Run 5 


16.74 


1 8.004 


18.192 


18.3 


19.015 


Run 6 


18.353 


18.388 


18.572 


1 8.766 


18.96 


Run 7 


17.435 


18.897 


19.571 


1 9.603 


19.617 


Run 8 


18.128 


19.111 


19.521 


1 9.643 


19.707 


Run 9 


17.317 


1 7.586 


17.655 


17.916 


18.69 


Run 10 


19.329 


19.899 


19.928 


20.104 


20.13 
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Figure 2 Secondary structure WebLogo. (a) PDB structures; (b) l-TASSER models; (c) Rosetta models. In (a) M represents the number of 
structures with missing values due to lack of resolution in the experimental data. The codes for secondary structure are as follows; H: a-helix; B: 
p-bridge; E: Strand; G: Helix-3; I: Helix-5; T: Turn; S: Bend; M: Missing data. 



Run 4 and Run 7, however, no pairs with TM-scores 
greater than 0.5 between Run 1 and Run 8. For Rosetta, 
the number of pairs with TM-score greater than 0.5 at 
run level ranges from 5 to 20. 20 of 25 pairs are with 
TM-scores greater than 0.5 between Run 3 and Run 7, 
which is the best. The smallest number of pairs for 
Rosetta is 5, which shows in 3 pairs, Run 1 and Run 6, 
Run 6 and Run 8, Run 5 and Run 8. 

These results show that our ensemble approach to pre- 
dict the structure of proteins associated with poly-Q dis- 
eases appears to be appropriate. Using multiple seeds it is 
possible to obtain an ensemble of structures that show 
reasonable diversity, but still retain the main features. 
We believe that this approach is quite promising because 
it can incorporate in future analysis the diverse structure 
of which appears to be an emerging observation from the 
limited experimental data on these proteins. 



Validity evaluation of l-TASSER and Rosetta 

As depicted in Figure 2(a) not all of the 21 PDB struc- 
tures have been resolved in the poly-Q region, which is 
our main interest. For instance, the longest well resolved 
poly-Q region is the B chain of the 3IOW [PDB: 3IOW] 
structure in which all the 17 Qs structures are resolved, 
whereas for the A chain of the 3IOT [PDB: 3IOT] struc- 
ture only one Q has been resolved. Also, there are 
numerous gaps in several structures as some of the resi- 
dues are not resolved. Taking this into account and in 
order to make an accurate comparison with the experi- 
mental ones in the region of interest, only PDB struc- 
tures in which at least 9 (more than half the total 
number) of consecutive Qs in the poly-Q region show 
well resolved structures were used for the evaluation of 
the results produced with I-TASSER and Rosetta. There 
are ten PDB structures that meet this criteria: the B 
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Figure 3 Distribution of TM-scores of any two models from I- 
TASSER and Rosetta respectively 



Table 6 Number of pairs with TM-score greater than 0.5 
between any two runs of l-TASSER. 



Table 4 TM-scores between the best models from 



TASSER. 


run 


# 2 


# 3 


# 4 


# 5 


# 6 


# 7 


# 8 


# 9 


# 10 


# 1 


0.48 


0.48 


0.63 


0.48 


0.54 


0.57 


0.49 


0.51 


0.46 


# 2 




0.56 


0.54 


0.55 


0.48 


0.54 


0.45 


0.58 


0.46 


# 3 






0.54 


0.63 


0.53 


0.51 


0.54 


0.56 


0.51 


#4 








0.50 


0.52 


0.66 


0.49 


0.56 


0.49 


# 5 










0.60 


0.51 


0.58 


0.59 


0.54 


# 6 












0.51 


0.68 


0.54 


0.49 


# 7 














0.50 


0.57 


0.45 


# 8 
















0.50 


0.52 


# 9 


















0.50 


Table 5 TM-scores between best models from Rosetta. 


run 


# 2 


# 3 


# 4 


# 5 


# 6 


# 7 


# 8 


# 9 


# 10 


# 1 


0.30 


0.43 


0.37 


0.42 


0.42 


0.53 


0.40 


0.38 


0.43 


# 2 




0.30 


0.34 


0.31 


0.39 


0.28 


0.34 


0.32 


0.38 


# 3 






0.52 


0.53 


0.53 


0.65 


0.47 


0.57 


0.50 


#4 








0.48 


0.53 


0.47 


0.68 


0.49 


0.43 


# 5 










0.63 


0.52 


0.48 


0.58 


0.61 


# 6 












0.52 


0.47 


0.64 


0.59 


# 7 














0,'l 1 


0.56 


0.55 


# 8 
















0.46 


0.42 


# 9 


















0.60 



chain of 3104 [PDB: 3104] (3io4_b), the C chain of 
3104 [PDB: 3104] (3io4_c), the B chain of 3106 [PDB: 
3106] (3io6_b), the C chain of 3106 [PDB: 3106] 
(3io6_c), the C chain of 3IOR [PDB: 3IOR] (3ior_c), the 



#2 #3 #4 #5 #6 #7 #8 #9 



# 10 



# 1 

# 2 

# 3 

# 4 

# 5 

# 6 

# 7 

# 8 

# 9 



Table 7 Number of pairs with TM-score greater than 0.5 
between any two runs of Rosetta. 



run 


# 1 


# 2 


# 3 


# 4 


# 5 


# 6 


# 7 


# 8 


# 9 


# 10 


# 1 


4 


6 


10 


11 


8 


5 


11 


12 


8 


9 


# 2 




2 


11 


8 


7 


6 


13 


11 


11 


10 


# 3 






6 


15 


9 


13 


20 


16 


18 


16 


# 4 








3 


9 


7 


14 


13 


16 


14 


# 5 










4 


8 


15 


5 


16 


15 


# 6 












1 


8 


5 


13 


9 


# 7 














5 


15 


16 


14 


# 8 
















8 


7 


13 


# 9 


















9 


17 


#10 




















6 



Table 8 Numbers of Qs in the PDB structures. 



PDB structure 


Number of Qs 


3io4_b 


10 


3io4_c 


1 1 


3io6_b 


11 


3io6_c 


10 


3ior_c 


13 


3iot_b 


12 


3iou_c 


14 


3iov_b 


11 


3iov_c 


15 


3iow_b 


17 



B chain of 3IOT [PDB: 3IOT] (3iot_b), the C chain of 
3IOU [PDB: 3IOU] (3iou_c), the B chain of 3IOV [PDB: 
3IOV] (3iov_b), the C chain of 3IOV [PDB: 3IOV] 
(3iov_c), and the B chain of 3IOW [PDB: 3IOW] 
(3iow_b). The number of consecutive Qs in each struc- 
ture is shown in Table 8. 

The best 50 I-TASSER and 50 Rosetta models were 
compared with these 10 PDB structures using the TM- 
align program. TM-scores, root-mean-square deviation 
(RMSD), aligned number of residues, sequence identity 
and the structure superposition were obtained from 
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TM-align [36]; the number of exact matches and the 
number of exact matched Qs were extracted from the 
structure alignment and finally the exact structure over- 
lap (ESO) and exact structure overlap of Qs (ESOP) 
were calculated using equation (2) and equation (3) 
given in the methods section. The values of each simi- 
larity parameter considered here are shown in Table 9 
along with the p-values assessing the significance of the 
difference between the I-TASSER and Rosetta results. 

The average TM-score of I-TASSER/PDB superposi- 
tion pairs is 0.50 and the average TM-score of Rosetta/ 
PDB pairs is 0.45, reflecting the fact that 253, of the 
500, I-TASSER/PDB pairs have TM-scores greater than 
0.5 while only 87 pairs of the Rosetta/PDB pairs have 
TM-scores greater than 0.5. The average RMSD of 
I-TASSER/PDB pairs (1.53 A) is also smaller than that 
of Rosetta/PDB pairs (1.74 A). Other TM-align para- 
meters depicted in Table 9 also show that I-TASSER 
performs better than Rosetta in this test. 

The structure overlap scores, ESOP and ESO, for 
I-TASSER models are also better than those for Rosetta 
models. For instance more than 75% of the Rosetta 
models have no exact match in the poly-Q region nor 
for the entire sequence, whereas the 75% quantile of the 
ESO and ESOP scores for I-TASSER are 53.13 and 9.09, 
respectively. The statistical tests have shown that these 
differences are significant (Table 9). 

Fifty of the I-TASSER/PDB structure superpositions 
have ESOP values greater than or equal to 50, which 
means that 50 pairs have more than 50% of Qs in the 
poly-Q region with exact match. These 50 pairs include 



Table 9 Distribution of structure superposition 
parameters between predicted models and PDB 
structures 





I-TASSER 


Rosetta 


p-value 


TM-score 


0.50 ± 0.06 


0.45 ± 0.06 


<0.0001 


RMSD (A) 


153 ± 0.34 


1 .74 ± 0.34 


<0.0001 


Aligned number 


24.05 ± 2.14 


25.56 ± 2.41 


<0.0001 


Sequence Identity a 


(0.30,071) 


(0.38,0.52) 


<0.0001 


Exact Match (<5.0 A) a 


(0,16) 


(0,0) 


<0.0001 


Exact Qs Match(<5.0 A) a 


(0,1) 


(0,0) 


<0.0001 


Total Qs Match(<5.0 A) a 


(5,8) 


(8,11) 


<0.0001 


Exact Match (other) a 


(0,0) 


(0,0) 


<0.0001 


Exact Qs Match(other) a 


(0,0) 


(0,0) 


<0.0001 


Total Qs Match(other) a 


(0,1) 


(0,1) 


<0.0001 


Exact Match (all) a 


(6,25) 


(0,0) 


<0.0001 


Exact Qs Match (all) a 


(0,1) 


(0,0) 


<0.0001 


Total Qs Match (all) a 


(6,8) 


(9,12) 


<0.0001 


ESOP a 


(0,9.09) 


(0,0) 


<0.0001 


ESO a 


(0, 53.13) 


(0,0) 


<0.0001 



a The values between brackets represent the value of the property for the 
best structure superposition at the first and third quartile, respectively, of their 
distributions. 



9 of the 10 PDB structures, so 9 of the 10 structures 
have corresponding I-TASSER models with very good 
matches in the poly-Q regions. In contrast only 5 of 
these 10 structures have corresponding Rosetta/PDB 
structure superposition matches when the same criteria 
are used. 

The best matches between the predicted structures by 
I-TASSER and Rosetta, respectively, and one of the PDB 
structures considered here are depicted in Figure 4. The 
I-TASSER structure best match is with the B chain of 
3106 [PDB: 3106]; the match has a TM-score of 0.56 
and the ESOP score of 100. The best two matches for 
Rosetta structures show matches with the C chain of 
3IOU [PDB: 3IOU] and the B chain of 3IOW [PDB: 
3IOW]. Their TM-scores are 0.5074 and 0.5057, respec- 
tively, and the ESOP score of 100. 

Discussion 

This study evaluated two software tools for predicting 
from amino acid sequences, the 3D structures of the 
poly-Q regions of proteins related to polyglutamine dis- 
eases. Pathogenic neurodegenerative poly-Q proteins 
were used as a model, for relevance to developing struc- 
ture-specific therapeutics based on normal vs. poly-Q 
expanded protein structures. Two highly recognized and 
efficient 3D structure prediction programs, I-TASSER 
and Rosetta, were evaluated to assess their performance 
for structure prediction using segments of the hunting- 
tin protein harboring poly-Q repeats. Both I-TASSER 
and Rosetta produced good results. 

When tested for structure stability under changes of 
the initial random seed, Rosetta shows less variability 
than I-TASSER. This means that if we run Rosetta and 
I-TASSER several times respectively, it is possible that 
we will get less variance in the results from Rosetta than 
from I-TASSER. None-the-less, both programs produce 
a reasonable ensemble of structures with sufficient 
diversity and without extreme deviations. Several studies 
have illustrated that the poly-Q repeat regions of these 
proteins are highly disordered with structure flexibility 
[31], but this has not been quantified experimentally. 
Therefore it is challenging to discriminate among these 
two approaches using these criteria. In consequence we 
must conclude that both I-TASSER and Rosetta are sui- 
table for the task on predicting ensemble structures of 
protein containing poly-Q segments. 

The accuracy of the prediction program is a very 
important factor that we evaluated here. In this study, 
the structure similarity between the predicted models 
and the PDB experiment structures available was used 
to evaluate the validity of the prediction programs. The 
root-mean-square deviation (RMSD) score is the most 
often-used parameter to calculate the structure similar- 
ity, but a drawback of its use is that a relatively small 
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(c) 



Figure 4 Structure superposition of predicted models and PDB structures. Structure superposition of predicted models and PDB structures 
with TM-score>0.5 and ESOP = 100. (a) l-TASSER third model of the tenth run with 3io6_b; (b) Rosetta forth model in first run with 3iou_c; 
(c) Rosetta third model in fifth run with 3iow_b- tan: PDB structure, sky-blue: predicted models. The N-terminal end of each structure is shown 
on the left. 



local variation can result in a high RMSD [37]. 
TM-score weights the close atom pairs stronger than 
the distant matches, and it is more sensitive to the 
topology fold than the RMSD [37]. Besides the global 
similarity measured by TM-score, more restricted scores 
on the exact match of two structures were also calcu- 
lated. The exact structure overlap (ESO), derived from 
the structure overlap (SO) score [38], was introduced 
and instead of calculating the number of aligned pairs it 
counts the exact match pairs, which not only counts 
aligned residues but also residues that lie in the same 
positions in both the sequences of predicted model and 
PDB structure. The exact structure overlap of poly-Q 



repeat (ESOP) is the special version of ESO, which is 
used to measure the prediction accuracy in the poly-Q 
region. Considering the TM-score, ESO and ESOP 
together gives a more comprehensive view of similarity 
between the predicted model and the PDB experimental 
structures from both a global and a local aspect. The 
ESO score and ESOP score can be used for similarity 
comparison tasks, especially if there are regions which 
play more critical roles than others. 

Rosetta models have a larger number of aligned resi- 
dues on average than I-TASSER, but the average RMSD 
values and TM-scores are much higher (lower) than 
that of I-TASSER. So when the Rosetta models are 
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aligned with the PDB structures, the distance between 
the models and the experimental structures is large, 
which is not a good sign for good structural matches. 
On the contrary, I-TASSER models aligned better with 
PDB structures not only with better RMSD and 
TM-scores, but also better ESO and ESOP scores. This 
can also be seen from the secondary structures patterns. 
When considering specific structure pairs, both I-TAS- 
SER and Rosetta have predicted models which can 
match the PDB structures with good global (TM>0.5) 
and local (ESO>=50 and ESOP>=50) structures. So both 
Rosetta and I-TASSER have the ability to get the native 
models, but for the overall performance, I-TASSER 
appears to be better than Rosetta. 

As several models are returned by the structure predic- 
tion programs, it is important to have criteria to select the 
best models. However, the model with the lowest energy 
in the prediction program may not be the best model for 
reproducing the poly-Q regions. For instance for Rosetta, 
the two predicted models with TM-score greater than 0.5 
and ESOP of 100 (Figure 4(b) and 4(c)) are not the models 
with the lowest energy in that Rosetta run. This is true 
also for the I-TASSER model with TM-score greater than 
0.5 and ESOP of 100 (Figure 4(a)). In fact, of the 29 good 
models which have TM-score greater than 0.5 and ESOP 
score greater than 50, only one model is ranked as the 
best by I-TASSER. 

Conclusions 

Both I-TASSER and Rosetta can be used for in silico 
studies of the structures of proteins with poly-Q repeats 
related to neurodegenerative diseases. However, I-TAS- 
SER shows better performance than Rosetta when con- 
sidering the overall agreement between results produced 
using these two prediction models with the limited 
experimental results available for comparison. 

Both I-TASSER and Rosetta are computationally effi- 
cient as both applications can be easily parallelized by 
executing numerous jobs each with a unique random seed. 

In our future studies we will attempt to predict the 
change of the structure as function of the number of Qs in 
the poly-Q repeat segment for all the proteins involved in 
poly-Q neurological diseases. Ideally we could use both 
these two programs to predict structures of the poly-Q 
disease related proteins. This could provide a quasi "crowd- 
sourcing" mechanism to cross check the results, but may 
prove computationally too expensive (see Methods). 
Therefore the results presented here suggest that studies 
should be, at least initially, performed using I-TASSER. 

Methods 

Poly-Q segments 

We searched the RCSB PDB database [39] for structures 
with more than 10 consecutive glutamines in their 



sequences on November 2012. A total of 11 structures 
were retrieved, including 7 of the first exon of the hun- 
tingtin protein with 17 glutamines (HTT17Q-EX1) [28] 
and 4 of the first exon of huntingtin protein with 36 
glutamines (HTT36Q-EX1) [29]. Figure 1(a) shows the 
sequence construction for the X-Ray diffraction experi- 
ment on HTT17Q-EX1 which was expressed and crys- 
tallized as a maltose-binding (MDP) fusion protein [28]. 
The same methods were used to get the crystal struc- 
ture of HTT36Q-EX1, but the resolution of the 
HTT36Q-EX1 is of such poor quality that only 
HTT17Q-EX1 structures were used in this study. 

PDB identification numbers of the 7 HTT17Q-EX1 crys- 
tal structures used here are 3104 [PDB: 3104], 3106 [PDB: 
3106], 3IOT [PDB: 3IOT], 3IOU [PDB: 3IOU], 3IOR 
[PDB: 3IOR], 3IOV [PDB: 3IOV] and 3IOW [PDB: 3IOW]. 
Each crystal includes a trimer of MDP-HTT17Q-EX1, so a 
total of 21 structures of HTT17Q-EX1 were considered. 
Figure 1(b) shows the sequence of the HTT17Q-EX1 used 
as the input of the 3D structure prediction. 

Protein 3D structure prediction 

Two protein structure prediction programs were used in 
this study, I-TASSER and Rosetta. Both I-TASSER and 
Rosetta have been used by thousands users and they are 
among the few programs which can handle large pro- 
teins with more than 1000 residues [34,40]. 

I-TASSER is the 3D structure prediction program based 
on multiple-threading alignments and iterative template 
fragment assembly simulations [41]. I-TASSER is a fully 
automated method and was used without further modifi- 
cations, but we have verified that none of the templates 
corresponding to the structures 3104 [PDB: 3104], 3106 
[PDB: 3106], 3IOT [PDB: 3IOT], 3IOU [PDB: 3IOU], 
3IOR [PDB: 3IOR], 3IOV [PDB: 3IOV] and 3IOW [PDB: 
3IOW] was included in the knowledge data used in the 
version of I-TASSER used here. Rosetta is a flexible soft- 
ware suite for macromolecular modeling, which includes 
tools for structure prediction and design [42] . Rosetta ab 
initio module was used in this study. For Rosetta, the 
quota protocol fragment picking was used to generate 3- 
mers and 9-mers fragments, which took into account the 
secondary structure predictions by PsiPred [43], Jufo9D 
Server [44] and SAM-T08 [45] as the quota pools. The 
weight given to the each quota pool was assigned follow- 
ing reference [46] and 200 fragments were picked from 
the total of 700 candidates available from both 3-mers and 
9-mers fragments. The default parameters were used for 
Rosetta ab initio modelling with the number of output 
structures set as 5000, the default parameters also were 
used for Rosetta cluster module. 

We installed I-TASSER Version 2.1 and Rosetta Ver- 
sion 3.4 in a cluster at the Center for High Performance 
Computing (CHPC) of University of Utah, where all 
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computations were performed. As a fully automated pro- 
gram, the number of decoys to screen and the number of 
simulation jobs in I-TASSER are fixed, whereas Rosetta is 
much more flexible and users can define the output 
number of structures and the number of parallel simula- 
tion jobs, making it much more adaptable to the hard- 
ware architecture used. So it is difficult to compare the 
computational cost of the two programs. However, for 
the modelling tasks with the parameters used in our 
simulation, the total CPU time for I-TASSER to finish 
one HTT17Q-EX1 (60 amino acid residues) prediction 
was, in average, 24.58 hours using one core in a 2.4 GHz 
dual-core Opteron processor, whereas the average total 
CPU time for Rosetta to finish one HTT17Q-EX1 predic- 
tion with 5000 prediction structures was about 50.91 
hours in the same computing environment. 

3D structure alignment 

To assess 3D structure similarity, TM-align was used for 
structure comparison and alignment [36]. The TM-score 
calculated by TM-align, which lies in (0,1] interval, is 
considered a good measure of the similarity of two 
structures [37]. A TM-score of less than 0.17 indicates a 
random alignment, whereas TM-score greater than 0.5 
indicates that the two structures are generally in the 
same fold [37]. 

Similarity measurement 

Besides the TM-score, exact structure overlap (ESO) and 
exact structure overlap of poly-Qs (ESOP) were also used 
to measure the similarity of two structures. The words 
'exact' here means the aligned residues are within certain 
threshold, 5A in this study, and that they are the same 
residue in the HTT17Q-EX1 sequence. For example, if a 
serine (SER) in the 16th position of the predicted structure 
of HTT17Q-EXlis aligned, within the distance threshold, 
with the serine (SER) in the 16th position of PDB experi- 
mental structure, the 16SER-16SER is an exact match. 
ESO and ESOP is derived from the Structure Overlap 
(SO) which is a standardized score to compare the struc- 
ture alignments and measure the local similarity of two 
structures [38]. The SO score is calculated as: 



SO = 100 x 



L{A) 



min (Lm, Le) 



(1) 



where L(A) is the structure alignment length; the Lm 
and Le are the length of the predicted model and the 
experimental structure, respectively. 

We have modified Equation (1) to meet the aim of 
more strict structure comparison, and get the ESO score: 



where L(EA) is the length of exact match; Lm and Le 
is the length of predicted model and the length of the 
PDB experimental structure respectively. 

The structure of poly-Q region may play a more 
important role than other positions. In this study, the 
ESOP score is calculated to evaluate the structure simi- 
larity of the poly-Q regions. The ESOP is a special ver- 
sion of ESO, and it is calculated as: 



ESOP = 100 x 



L(EAQ) 



min (LQm, LQe) 



(3) 



where L(EAQ) is the length of the exact match of Qs; 
LQm and LQe are the length of poly-Q in predicted 
model and PDB experimental structure respectively. 

Secondary structure calculation 

The secondary structure of the predicted models and 
the PDB experimental structures were calculated using 
the DSSP algorithm, which is an algorithm to standar- 
dize secondary structure assignment [47]. Secondary 
structures assigned by DSSP are 8 conformational states, 
including a-helix, |3 -bridge, strand, 3-helix, 5-helix, turn, 
bend, and random coil. 

The results of DSSP are the secondary structures 
represented by one letter for each position. In order to 
get a better view of the results, 'WebLogo 3 ' [35] was 
used to plot the secondary structure logo at each posi- 
tion. The overall height of the stack indicates the sec- 
ondary structure conservation at that position, and the 
height of the symbols within the stack indicates the rela- 
tive frequency of each secondary structure type at that 
position. 

3D structure visualization 

The 3D structure and the 3D structure superposition 
were visualized in the UCSF Chimera software, a free 
program for molecular graphics and analysis [48]. 

Statistics 

To depict the data distribution of the parameters calcu- 
lated here, the (mean value ± standard deviation) is 
listed for data with normal distribution, whereas for 
data that do not follow the normal distribution, the 25% 
quantile and 75% quantile values are listed. 

The Student t test was applied for data with normal 
distribution and the Wilcoxon ranked test was per- 
formed on other data sets to assess significance. The 
significant level was set at 0.05. All the statistic work 
was done in the R environment which is a free software 
environment for statistical computing and graphics [49]. 



(2) 
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